{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "\n",
    "本文主要介绍sklearn中进行特征选择的方法。[sklearn.feature_selection](https://scikit-learn.org/stable/modules/feature_selection.html)模块中的类可用于样本集的特征选择/降维，以提高估计量的准确性得分或提高其在超高维数据集上的性能。\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 多行输出\r\n",
    "from IPython.core.interactiveshell import InteractiveShell\r\n",
    "InteractiveShell.ast_node_interactivity = \"all\" "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 1  删除低方差的特征\n",
    "sklearn中[VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold)是特征选择的简单基准方法。它将删除方差未达到某个阈值的所有要素。默认情况下，它将删除所有零方差特征，即在所有样本中具有相同值的特征。\n",
    "其原理很简单，假设某个特征的特征值只有0和1，在所有输入样本中，某个特征百分之95的值都是取相同的数，那就可以认为这个特征作用不大。如果100%都是1，那这个特征就没意义了。当特征值都是离散型变量的时候这种方法才能用，如果是连续型变量，就需要将连续变量离散化之后才能用，在实际当中，很少出现95%以上都取某个值的特征存在，所以这种方法虽然简单但是不太好用。可以把它作为特征选择的预处理，先去掉那些取值变化小的特征，然后再从接下来提到的的特征选择方法中选择合适的进行进一步的特征选择。\n",
    "\n",
    "举个例子，假设我们有一个具有布尔特征的数据集，我们想要删除80%以上的样本中要么是1要么是0的所有特征。布尔特征是伯努利随机变量，这些变量的方差由下式给出：\n",
    "\n",
    "$\\mathrm{Var}[X] = p(1 - p)$\n",
    "\n",
    "因此我们可以使用阈值进行选择：0.8 * (1-0.8)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原始值为： [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[0, 1],\n",
       "       [1, 0],\n",
       "       [0, 0],\n",
       "       [1, 1],\n",
       "       [1, 0],\n",
       "       [1, 1]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_selection import VarianceThreshold\r\n",
    "X = [[0, 0, 1],[0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]\r\n",
    "print(\"原始值为：\",X)\r\n",
    "selector = VarianceThreshold(threshold=(0.8 * (1 - 0.8)))\r\n",
    "# 方差过滤后的值\r\n",
    "# 这一步与selector.fit(X)和selector.transform(X)实现的功能一样\r\n",
    "selector.fit_transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "该方法的原理为我们可以计算每列的总体方差，然后给定一个方差阈值，比如0.8 * (1 - 0.8)表示伯努利分布80%数为同一个数时样本的方差。如果某列方差小于0.8 * (1 - 0.8)，表示该列超过80%的数据为同一个数值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原始数据每列总体方差为\n",
      "[0.13888889 0.22222222 0.25      ]\n",
      "保留的特征列号为\n",
      "[1 2]\n",
      "反推原始数据列为\n",
      "[[0 0 1]\n",
      " [0 1 0]\n",
      " [0 0 0]\n",
      " [0 1 1]\n",
      " [0 1 0]\n",
      " [0 1 1]]\n"
     ]
    }
   ],
   "source": [
    "print(\"原始数据每列总体方差为\\n{}\".format(selector.variances_))\r\n",
    "print(\"保留的特征列号为\\n{}\".format(selector.get_support(True)))\r\n",
    "print(\"反推原始数据列为\\n{}\".format(selector.inverse_transform(selector.transform(X))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "我们也可以不设置过滤方差，但是默认阈值为0，也就是100%的数为同一个值的特征才过滤"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[2, 0],\n",
       "       [1, 4],\n",
       "       [1, 1]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]\r\n",
    "selector = VarianceThreshold()\r\n",
    "selector.fit_transform(X)\r\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原始数据每列总体方差为\n",
      "[0.         0.22222222 2.88888889 0.        ]\n",
      "保留的特征列号为\n",
      "[1 2]\n",
      "反推原始数据列为\n",
      "[[0 2 0 0]\n",
      " [0 1 4 0]\n",
      " [0 1 1 0]]\n"
     ]
    }
   ],
   "source": [
    "print(\"原始数据每列总体方差为\\n{}\".format(selector.variances_))\r\n",
    "# 第0列和第3列方差为0需要删除\r\n",
    "print(\"保留的特征列号为\\n{}\".format(selector.get_support(True)))\r\n",
    "print(\"反推原始数据列为\\n{}\".format(selector.inverse_transform(selector.transform(X))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# 2 参考\n",
    "> [https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold)\n",
    "\n",
    "> [https://www.cnblogs.com/tszr/p/10802018.html](https://www.cnblogs.com/tszr/p/10802018.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PaddlePaddle 1.8.0 (Python 3.5)",
   "language": "python",
   "name": "py35-paddle1.2.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
